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Household income plays a significant role in determining a country's 
socioeconomic standing. This measure is often used by the government to 
formulate the federal budget and policies that are most appropriate for 
national development. In spite of this, Malaysia's current economic 
circumstances continue to be characterized by income disparity. Therefore, 
this shortcoming can be addressed by analyzing the household income 
survey (HIS) conducted by Department of Statistics Malaysia (DOSM). In 
this study, the hybrid model is proposed where K-means and multiple linear 
regression (MLR) for clustering and predicting household income in 
Malaysia. Based on the experimental results, the K-means clustering 
analysis in conjunction with the MLR model outperformed the MLR model 
without clustering with a smaller mean square error. As a result, clustering 
analysis results in a more accurate estimate of household income because it 
reduces the variation between households. It is important that household 


income information reflect the concern of policymakers about the impact of 
universal and targeted interventions on different socioeconomic groups. 
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1. INTRODUCTION 

The term household income refers to all earnings of the household or individual family members, 
regardless of whether they are in the form of monetary or in-kind goods and services, which are available for 
current consumption annually or more frequently [1]. There are several components of household income, 
including employment income, whether employed or self-employment, other earned income, property 
income as well as current transfers [2]-[4]. It is generally accepted that household income is the most 
significant indicator of economic well-being as it helps measure a household's resources for saving and 
consumption [5], [6]. 

In addition to measuring a citizen's socioeconomic status, household income also plays an important 
role in determining what policies should be implemented to promote national development [7]. The standard 
of living of an area can be evaluated based on this economic indicator. Moreover, household income was 
studied to determine whether policymakers were successful in addressing Malaysian economic inequality [8], 
[9]. The solution is particularly helpful since it ensures that poverty can be overcome effectively in a 
particular area. Policymakers will evaluate the impact of universal and targeted actions on different 
socioeconomic groups based on every piece of household income information. Policy issues involving 
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welfare, taxation, housing, education, labour market, health and other fiscal policies are influenced by data 
related to income distribution [6]. 

Furthermore, Malaysians also suffer from income inequality, influenced by racial backgrounds, 
geographical areas, and various other factors. Affluent Malaysians and the rest of the country often earn 
different amounts, with the gap widening each year. It is imperative that unequal income distribution be 
addressed as soon as possible since it undermines social cohesion and provides insufficient quality of life 
levels for Malaysians. Moreover, it is inconsistent with our national development initiative to promote 
growth with equity [1]. Therefore, modelling household income is duly necessary for policymakers to 
formulate an appropriate policy based on the socioeconomic status in Malaysia. 

This study is about applying K-means in a multiple linear regression (MLR) model toward 
household income in Malaysia. This study aims to identify the factors that affect a household's income. 
Lastly, the MLR model with and without K-means technique analysis will be compared using the mean 
square error (MSE) to find a better model. Other previous studies were done to extend the MLR method with 
the fuzzy regression technique in worldwide according to various fields of studies [10]-{13]. 


2. PROPOSED METHOD 

Multiple linear regression (MLR) is a popular method for analyzing multivariate factors [14]-[16]. 
Combining MLR with other methods can enhance the robustness and accuracy of the model. Previous study 
uses K-means clustering to divide the data into several disjoint groups that exhibit similar characteristics. 
This study uses K-means with a MLR model to cluster and predict household income in Malaysia. The 
purpose of this study is to cluster the factors that affect the income of a household. With K-mean clustering, 
it is possible to identify the significant factors that affect the income of a household. After identifying the 
number of clusters, the MLR analysis will be conducted based on the clusters. With the combination of MLR 
and K-means clustering, the MSE errors are expected to be minimized. Finally, the MSE of the MLR models 
with and without the K-means technique will be compared in order to find a better model. 


3. RESEARCH METHOD 
3.1. Data acquisition 

The household income survey (HIS) conducted by the Department of Statistics Malaysia (DOSM) in 
2012 provided detailed demographic and social information about each household. There are 22 variables 
included in the dataset, such as income category, education level, number of household members. A 
regression model for household income was formed using 12 variables based on their importance based on 
the MLR applied to the dataset as Table 1. 


Table 1. Data description for household income 


Variables Name of variable Variable type Variables Name of variable Variable type 

Y gross total Numeric X, head of household certificate Ordinal 
XxX, Strata Nominal Xe head of household activity Nominal 
Xy Weightdp Numeric Xo size of household Numeric 
X3 head of household age Numeric X10 Region Nominal 
X4 head of household gender Nominal Xi occupation of head household Nominal 
Xs head of household marital status Nominal X12 industry of head household Nominal 
X head of household education Ordinal 


3.2. Research methodology 
3.2.1. Multiple linear regression (MLR) model 
The MLR model for dependent variable of Y with k predictor variables can be written as in (1), 


Y, = Bo + BiXin + BoXig + + BeXin + E; (1) 


where Y; is the dependent variable, fo, 8, ...B, are coefficients of regression to be estimated with respect to 
observations, Xj, ..., X;, are explanatory variables and ¢; is the error term [17]. 

It would be convenient to express the MLR model in matrix notation [18]. In matrix notation, the 
model given in (2). The least-squares estimator # that minimizes the sum of squared residuals in MLR is 
expressed in (3). 
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Y=xXBpt+eE (2) 

B = (X™X)-1XxTY (3) 
where Y is a nx1 vector of observation, X is an nxp matrix of the levels of the regressor variables and P is 
px1 vector of the regression coefficients. 

Several key assumptions must be fulfilled before running MLR model. The assumptions are 


multivariate normality, multicollinearity, independence, and homogeneity of variance [19], [20]. The 
specified method of checking the assumptions of linear regression in this study is tabulated in Table 2. 


Table 2. Method on evaluating assumptions of MLR model 


Assumptions Assessing method 
Multivariate normality Normal Probability-Probability plot 
Multicollinearity Variance Inflation Factor (VIF) 
Independence Durbin-Watson statistic 
Homogeneity of variance Box-Cox plot 


Serious multicollinearity is indicated by a VIF value greater than 10. Multicollinearity can be solved 
by discarding the variables with the highest VIF, resulting in a model with little to no multicollinearity. The 
VIF is expressed in the formula in (4) [18], 


1 
VIF, = 1-R2 (4) 


where RZ is the R? value obtained by regressing the k” predictor on the explanatory variables. Durbin- 
Watson's statistic of the model should approach a value of two to fulfill the model's independence. For the 
Box-Cox plot to determine homoscedasticity, the estimated rounded lambda value should be 1, which 
indicates constancy in error variance [18]. 

Other than that, the coefficient of multiple determination, R* is another important indicator. This is 
because it not merely measures how well the model fits a set of observations, but also elaborates the variation 
amount of the dependent variable, which is explained by the regression equation. R? can be simplified as the 
variation proportion in the response variable accounted by independent variables. The higher the R?, the 
more variation amount in the dependent variable can be explained by the predictor variables. Thus, most of 
the observations will fall on the fitted regression line [20]. The formula of R? is given as in (5). 


Ko vkZ 
2 SSR _ sr(vi-¥) . 
~ SSTO sr(vi-¥). (5) 


where SSR = Sum square of regression 
SSTO = Sum square of total 
Y, = Fitted regression line 
Y = Mean of Y 
Y; = Data of dependent variable 


3.2.2. K-means clustering method 

Meanwhile, K-means clustering analysis is a useful method to reduce error rates since it will classify 
the data observations to the nearest cluster based on minimum distance computed. Euclidean’s distance is 
commonly applied in clustering analysis to classify observations. The distance between two objects, O; and 
O; in p-dimensional space are calculated using euclidean distance formulated as in (6) [21], 


Euclidean(0;,0;) = [221 ed Oya) 


where the i, j is the i‘” and j*” data object and p is number of features. The centroid of the i” cluster is 
defined as in (7) [22], 


1 
c= xe * ) 
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where c;is cluster centroid; m;refers to objects number in i*” cluster; x is a data object and C;is i‘ cluster. 

The K-means clustering method determines the optimal number of clusters using average silhouette 
width. The average silhouette width near to the value of 1 signifies the observations are well clustered. In 
K-means clustering analysis, the calculation of centroids and distance from the centroids and grouping of 
data observations are iterative until a convergence point is reached [22], [23]. 


3.3. Accuracy comparison method 
3.3.1. Mean square error (MSE) 

MSE is found by calculating the average of the squared error. It measures the distance between the 
observed and actual value of estimator [18]. The formula is written as in (8) [24], [25], 


MSE = = o(V aos (8) 


where, y, is the real data at time t¢, 9, is the predicted value at time ¢ and n is the number of data involved. 
If there are only 2 clusters used, the MSE of clustering can be calculated using the in (9) [24], 


_ 14MSE, + n2MSE2 
MSE combined = (9) 


ny +n 


where n, and n, are sample sizes in cluster 1 and 2 respectively; MSE; and MSE are the mean square error 
for cluster 1 and 2 respectively. 


4. RESULTS AND DISCUSSION 

The model contains one dependent variable and 12 independent variables. Prior to data analysis, 
categorical independent variables were recoded into dummy variables. This dataset was assessed for its suitability 
for MLR based on its assumptions of linear regression. After that, it was found that the constancy of variance and 
multivariate normality were not fulfilled in these data by referring to the normal P-P plot and Box-Cox plot as 
shown in Figure 1. Since Box-Cox plot showed that the optimal lambda is -0.06, data transformation using power 
of -0.06 was applied simultaneously to dependent and independent variables. After performing Box-Cox 
transformation, the assumptions of linear regression were successfully achieved as Figure 2. 
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700000 e d 
2 \ (using 95.0% confidence) 
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7 \ Lower CL -0.08 
5 0.6 500000 \ Upper CL -0.05 
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eo 
3 04 400000 
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i + 300000 
s) 
02 ie 
200000 
0.0 . : - : i 
00 02 04 06 O8 10 100000 \ - 
Observed Cum. Prob 0 eae 4 
-3 -2 -1 0 1 


Figure 1. Normal P-P plot (left) and Box-Cox plot (right) of gross total standardized residuals 


The model was first run using MLR model. The initial MLR model showed the problem of 
multicollinearity and non-significance of certain predictor variables. The checking procedure was first 
focused on multicollinearity and then evaluated predictor variables' significance. The multicollinearity 
problem is solved by discarding 2 variables (x, and x;)) where VIFF > 10. Besides, predictors that were not 
significant are also discarded since they do not contribute to the model. The MSE of the final model is the 
main concern of this study to see the effectiveness of the model in reducing error rates and variation of 
household income. The MLR final model consists of 10 significant explanatory variables and has 
MSE=3.08x10~, as Table 3. The significant model is in (10). 
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Y =0.081 - 0.009 X)+ 0.068 X2+ 0.266 X;- 0.006 X1+ 0.000069 Xo+ (10) 
0 000 X7+ 0.004 X3+ 0.194 X9+ 0.002 X7;+ 0.000 X72 


Normal P-P Plot of Regression Standardized Residual 


Dependent Variable: Transformed gross total 
1.0 


Expected Cum Prob 


00 02 04 o6 08 1.0 
Observed Cum Prob 


Figure 2. Normal P-P plot of the transformed gross total of standardized residuals 


Table 3. Analysis of variance (ANOVA) table of MLR final model 
Model Sum of Squares Df ___ Mean Square F Sig. 


1 Regression 4.305 10 431 1398.008 0.0 
Residual 4.072 13221 000 
Total 8.377 13231 


In the second step, K-means clustering analysis and the MLR model were combined. In this 
combination model, MSE value was used to assess whether this hybrid model provided an accurate 
prediction. The silhouette analysis was used to determine the optimal number of clusters. The output of 
silhouette analysis is tabulated as Table 4. 


Table 4. Average silhouette width for various cluster number 
Cluster number, k — Average Silhouette Width 
0.65 
0.44 
0.55 
0.50 
0.53 
0.54 
0.54 
0.48 


OmAINIDUMHBRWY 


Considering its average silhouette width is closest to one, the optimal number of clusters was 
identified as two clusters. When applying K-means clustering analysis, user can choose the variables of 
interest as clustering variables. In this study, clustering analysis was done based on (i) dependent variable and 
(ii) all variables. All the clustering variables were standardized first as the scale for variables was not the 
same. The MSE obtained from clustering based on the dependent variable is 1.5811x10*; whereas MSE 
obtained from clustering based on all variables is 2.9962x10. The comparison was then done on the MLR 
model with and without K-means clustering analysis as in Table 5. 

Based on Table 6, for cluster 1 and 2, four independent variables were excluded (due to 
multicollinearity (x; and x;,9) and nearly 0 of coefficient). Then the 8 independent variables are included as 


K-means clustering analysis and multiple linear regression model on household income in ... (Gan Pei Yee) 


736 0 ISSN: 2252-8938 


in model in Table 6 for both clusters. The important contributors of household income for cluster 1 according 
to this model encompass 8 variables which are strata (X7), weight density population (Xz), head of household 
age (X3), head of household gender (X4), head of household education (Xo), head of household certificate (X7), 
head of household activity (Xs) and household size (X59). Whereas for cluster 2, the 8 important contributors 
are strata (X7), weight density population (X2), head of household age (X3), head of household gender (X4), 
head of household certificate (X7), head of household activity (Xs), household size (Xo) and head of 
household occupation (X7). 


Table 5. Comparison on performance of different models 
MLR + K-Means Clustering MLR + K-Means Clustering 


ee MER only (Dependent Variables) (All Variables) 
MSE (after transformation) 3.0799x 107 1.5811 x 104 2.9962 x 104 
MSE (transformed back to original) 2.2605 x 10° 2.0053 x 10° 2.2498 x 10° 


Table 6. Summary model of K-means in MLR model 


Cluster Cluster | Cluster 2 
MLR model Y= 0.337 - 0.003 X; + 0.032 X+ 0.104 X3; - 0.003 Y= 0.229 - 0.005 X; + 0.025X> + 0.219 X3- 0.002 
X4+ 3.675E-5 Xo6+ 0001 X7+ 0.002 Xg+ 0.114 Xo X4+0 001 X7+ 0.002 Xs +0.096 Xo + 0.002 X11 
MSE 1.5740 x107 1.5870 x10“ 


The combination of K-means clustering analysis and MLR model is more effective in reducing error 
rates than the MLR model without K-means clustering analysis since it provides a smaller MSE. Since the 
model with the dependent variable as a clustering variable gives the smaller MSE, the significant predictors 
are extracted. The analysis also showed that the transformed gross total household income for cluster 1 is 
directly proportional to weight density population (X2), head of household age (X3), head of household 
education (X¢), head of household certificate (X7), head of household activity (Xs) and household size (Xo). 
Meanwhile, the transformed gross total household income is inversely proportional to strata (X1) and head of 
household gender (X4). For cluster 2, the transformed gross total household income is directly proportional to 
weight density population (X2), head of household age (X3), head of household certificate (X7), head of 
household activity (Xs), household size (Xo) and occupation of head household (X1:). Meanwhile, the 
transformed gross total household income is inversely proportional to strata (X1) and head of household 
gender (X4). 

The contributors of household income, regardless of demographic factors or geographical 
restrictions can essentially affect the earnings of a household. Government can retrieve more information 
about this and formulate suitable policies so that all Malaysians can enjoy a high standard of living all the 
time. The information through statistical analysis is important to top management in decision making to 
optimize the economic situation. 


5. CONCLUSION 

Malaysian household income can be significantly influenced by geographic and demographic 
characteristics. Considering the income gap in Malaysia, it is crucial that the data be modeled using an 
appropriate technique to minimize the error rates arising from the income gap. MLR model can help to 
analyze the contributing factors effectively using statistical approach. However, to attain accurate and 
reliable results, clustering analysis such as the K-means approach can effectively reduce the variability in 
household income due to income gap by clustering the data before performing the MLR model. It is 
recommended to incorporate more potential contributors of household income into the model, for example 
expenditures, to make the model more reliable. Future researchers can also consider using other types of 
clustering techniques such as the fuzzy c-means approach to get the best model with the lowest error rates. 
Other than that, the researcher can rationally choose different variables as clustering variables where 
appropriate to obtain certain important discoveries. 
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